Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. Are they doing exactly like what they were told, or are they making some of the common mistakes? In this project, we aim to use data collected from a set of sensors to predict the manner someone performas some of the simple weight lifting exercise.
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). 4 fifferent accelerometers have been placed on the participants themselves and on the dumbbell - 3 on the belt, forearm and arm, and 1 on dumbell. Participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
To solve this problem, we first applied Principal Componenet Analysis to narrow down the vairables from 160 to 25. We then applied two multi-claccification models - Random Forest and K-nearest Neighbours. We selected Random Forest due to higher accuracy.
training <- read.csv("pml-training.csv", na.strings = c("", "NA"))
testing <- read.csv("pml-testing.csv", na.strings = c("", "NA"))
# We hided results in the final output because it's too long
head(training,10)
dim(training)
## [1] 19622 160
Our obeservations: 1) the dataset is quite large. It also has a lot of variables. 2) Each sensor captures multiple datapoints. Some variables has a lot of NA values. They might not be useful and we should consider exlcude these variables. 3) The name of variable ends with the name of the particular sensor that is providing the datapoint. This allows us to group variables together and possibily conducts a PCA to reduce dementions.
# We hided results in the final output because it's too long
library(dplyr)
col_na_count <- training %>%
select(everything()) %>%
summarise_all(funs(sum(is.na(.))))
col_na_count
Here we have count of NA in each column. As we can see, for columns with NA, NA is the dominated value - there are 19216 NAs out of 19622 rows. Let’s remove columns with more than half rows NA.
# We hided results in the final output because it's too long
training_filter <- training %>%
select(-which(col_na_count > 0.5 * nrow(training)))
head(training_filter)
Now we are down to 60 variables.
Among the predictors, we can see they can be grouped in to groups based on their names. Let’s group them and check the correlation among variables. First let’s take a look at Belt.
# We hided results in the final output because it's too long
training_belt <- training_filter %>% select(contains("belt"))
cor(training_belt)
head(training_belt)
As we can see, a lot of variables are highly correlated, such as -0.99. We can use Principle Componenet Analysis to reduce the dementions. Notice the scales of these variables are very different and there are both positive and negative values, we will need to center and scale our data as well.
library(caret)
training_belt_pcamodel <- preProcess(training_belt, method = c("center", "scale", "pca"), thresh = 0.90)
training_belt_pcamodel
## Created from 19622 samples and 13 variables
##
## Pre-processing:
## - centered (13)
## - ignored (0)
## - principal component signal extraction (13)
## - scaled (13)
##
## PCA needed 4 components to capture 90 percent of the variance
As we can see, we only need 4 components do perserve 90% of the variance. This will decrease reduce our data dimentions dramatically.
# apply PCA model
training_belt_pca <- predict(training_belt_pcamodel, newdata = training_belt)
# rename the columne names
names(training_belt_pca) <- c("belt_pca1","belt_pca2","belt_pca3","belt_pca4")
Next we will repeat the process for the rest three group of variable - forearm, arm, and dumbell.
# Create model for forearm
training_forearm <- training_filter %>% select(contains("forearm"))
training_forearm_pcamodel <- preProcess(training_forearm, method = c("center", "scale", "pca"), thresh = 0.90)
training_forearm_pcamodel
## Created from 19622 samples and 13 variables
##
## Pre-processing:
## - centered (13)
## - ignored (0)
## - principal component signal extraction (13)
## - scaled (13)
##
## PCA needed 8 components to capture 90 percent of the variance
# We need 8 variables to perserve 90% variance
training_forearm_pca <- predict(training_forearm_pcamodel, newdata = training_forearm)
names(training_forearm_pca) <- c("forearm_pca1","forearm_pca2","forearm_pca3","forearm_pca4", "forearm_pca5","forearm_pca6","forearm_pca7","forearm_pca8")
# Create model for arm
training_arm <- training_filter %>% select(contains("_arm"))
training_arm_pcamodel <- preProcess(training_arm, method = c("center", "scale", "pca"), thresh = 0.90)
training_arm_pcamodel
## Created from 19622 samples and 13 variables
##
## Pre-processing:
## - centered (13)
## - ignored (0)
## - principal component signal extraction (13)
## - scaled (13)
##
## PCA needed 7 components to capture 90 percent of the variance
# We need 7 variables to perserve 90% variance
training_arm_pca <- predict(training_arm_pcamodel, newdata = training_arm)
names(training_arm_pca) <- c("arm_pca1","arm_pca2","arm_pca3","arm_pca4", "arm_pca5","arm_pca6","arm_pca7")
# Create model for dumbbell
training_dumbbell <- training_filter %>% select(contains("_dumbbell"))
training_dumbbell_pcamodel <- preProcess(training_dumbbell, method = c("center", "scale", "pca"), thresh = 0.90)
training_dumbbell_pcamodel
## Created from 19622 samples and 13 variables
##
## Pre-processing:
## - centered (13)
## - ignored (0)
## - principal component signal extraction (13)
## - scaled (13)
##
## PCA needed 6 components to capture 90 percent of the variance
# We need 6 variables to perserve 90% variance
training_dumbbell_pca <- predict(training_dumbbell_pcamodel, newdata = training_dumbbell)
names(training_dumbbell_pca) <- c("dumbbell_pca1","dumbbell_pca2","dumbbell_pca3","dumbbell_pca4", "dumbbell_pca5","dumbbell_pca6")
Now let’s combine all the variables after PCA.
training_pca <- data.frame(user_name = training[, 2], classe = training$classe, training_belt_pca, training_arm_pca, training_forearm_pca, training_dumbbell_pca )
Nowe we have 33 variables. Let’s do some Exploratory Data Analysis.
table(training_pca$user_name, training_pca$classe)
##
## A B C D E
## adelmo 1165 776 750 515 686
## carlitos 834 690 493 486 609
## charles 899 745 539 642 711
## eurico 865 592 489 582 542
## jeremy 1177 489 652 522 562
## pedro 640 505 499 469 497
Each of the 6 participants performed 5 classes of workout.
# plot classes vs users
library(plotly)
plot_ly(training_pca, x = ~classe, y = ~belt_pca1, color = ~user_name, type = "box") %>%
layout(title = "belt_pca1 for each class per user")
In this plot, we can see that user is differently a very important factor. No matter which classe is, the range of motion seems to be different from users to users. This tells us user should be a predictor included in our model.
plot_ly(subset(training_pca, user_name = "adelmo"), x = ~belt_pca1, y = ~forearm_pca1, z = ~dumbbell_pca1, color = ~classe) %>%
layout(title = "Adelmo performing different classes")
In this plot, we only look at Adelmo’s movements. We can see that: 1) each classes has different range of movements. 2) There might be some outliers in the range of movement. For example, one data point has extremely large dumbbell_pca1. However, Because we only plotted 3 vairables, it’s hard to see if these are real outliers, so we decided to keep them.
First, let’s slice the training data into training and validation sample.
# Define train control for k fold cross validation
# train_control <- trainControl(method="cv", number= 5, savePredictions = TRUE)
set.seed(123)
index <- createDataPartition(training$classe, p = 0.6, list = FALSE)
pca_tra <- training_pca[index,]
pca_val <- training_pca[-index,]
This is a mutiple calssification problem. The models we can use are tree based models and k-nearest neighbours.
#Random Forest
model1 <- train(classe ~., data= pca_tra, method="rf")
pca_val$pred_rf <- predict(model1, newdata = pca_val, type = "raw")
confusionMatrix(pca_val$classe, pca_val$pred_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2216 8 3 2 3
## B 56 1449 12 1 0
## C 3 32 1322 11 0
## D 6 0 62 1210 8
## E 0 0 11 7 1424
##
## Overall Statistics
##
## Accuracy : 0.9713
## 95% CI : (0.9674, 0.9749)
## No Information Rate : 0.2907
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9637
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9715 0.9731 0.9376 0.9829 0.9923
## Specificity 0.9971 0.9891 0.9929 0.9885 0.9972
## Pos Pred Value 0.9928 0.9545 0.9664 0.9409 0.9875
## Neg Pred Value 0.9884 0.9937 0.9864 0.9968 0.9983
## Prevalence 0.2907 0.1898 0.1797 0.1569 0.1829
## Detection Rate 0.2824 0.1847 0.1685 0.1542 0.1815
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9843 0.9811 0.9652 0.9857 0.9948
#KNN Model
ctrl <- trainControl(method="repeatedcv",repeats = 3)
model2 <- train(classe ~., data= pca_tra, method="knn", trControl = ctrl)
pca_val$pred_knn <- predict(model2, newdata = pca_val, type = "raw")
confusionMatrix(pca_val$classe, pca_val$pred_knn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2182 24 15 9 2
## B 81 1379 45 11 2
## C 17 30 1283 29 9
## D 5 4 68 1205 4
## E 2 16 15 20 1389
##
## Overall Statistics
##
## Accuracy : 0.948
## 95% CI : (0.9429, 0.9528)
## No Information Rate : 0.2915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9342
##
## Mcnemar's Test P-Value : 1.762e-12
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9541 0.9491 0.8997 0.9458 0.9879
## Specificity 0.9910 0.9783 0.9868 0.9877 0.9918
## Pos Pred Value 0.9776 0.9084 0.9379 0.9370 0.9632
## Neg Pred Value 0.9813 0.9883 0.9779 0.9895 0.9973
## Prevalence 0.2915 0.1852 0.1817 0.1624 0.1792
## Detection Rate 0.2781 0.1758 0.1635 0.1536 0.1770
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9725 0.9637 0.9432 0.9668 0.9898
Random Forest model has higher accuracy than the KNN model. It has 97% accuracy. We will use this model to predict the testing data.
col_na_count_testing <- testing %>%
select(everything()) %>%
summarise_all(funs(sum(is.na(.))))
testing_filter <- testing %>%
select(-which(col_na_count_testing > 0.5 * nrow(testing)))
# Create pca variables with "belt"
testing_belt <- testing_filter %>% select(contains("belt"))
testing_belt_pca <- predict(training_belt_pcamodel, newdata = testing_belt)
names(testing_belt_pca) <- c("belt_pca1","belt_pca2","belt_pca3","belt_pca4")
# Create pca variables with "forearm"
testing_forearm <- testing_filter %>% select(contains("forearm"))
testing_forearm_pca <- predict(training_forearm_pcamodel, newdata = testing_forearm)
names(testing_forearm_pca) <- c("forearm_pca1","forearm_pca2","forearm_pca3","forearm_pca4", "forearm_pca5", "forearm_pca6", "forearm_pca7", "forearm_pca8")
# Create pca variables with "_arm"
testing_arm <- testing_filter %>% select(contains("_arm"))
testing_arm_pca <- predict(training_arm_pcamodel, newdata = testing_arm)
names(testing_arm_pca) <- c("arm_pca1","arm_pca2","arm_pca3","arm_pca4", "arm_pca5", "arm_pca6", "arm_pca7")
# Create pca variables with "dumbbell"
testing_dumbbell <- testing_filter %>% select(contains("dumbbell"))
testing_dumbbell_pca <- predict(training_dumbbell_pcamodel, newdata = testing_dumbbell)
names(testing_dumbbell_pca) <- c("dumbbell_pca1","dumbbell_pca2","dumbbell_pca3","dumbbell_pca4", "dumbbell_pca5", "dumbbell_pca6")
# Combine groups of categories
testing_pca <- data.frame(user_name = testing[, 2], testing_belt_pca, testing_arm_pca, testing_forearm_pca, testing_dumbbell_pca )
testing_pca$pred <- predict(model1, newdata = testing_pca, type = "raw")
testing_pca$pred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E